Luts'k
hls4ml: A Flexible, Open-Source Platform for Deep Learning Acceleration on Reconfigurable Hardware
Schulte, Jan-Frederik, Ramhorst, Benjamin, Sun, Chang, Mitrevski, Jovan, Ghielmetti, Nicolò, Lupi, Enrico, Danopoulos, Dimitrios, Loncar, Vladimir, Duarte, Javier, Burnette, David, Laatu, Lauri, Tzelepis, Stylianos, Axiotis, Konstantinos, Berthet, Quentin, Wang, Haoyan, White, Paul, Demirsoy, Suleyman, Colombo, Marco, Aarrestad, Thea, Summers, Sioni, Pierini, Maurizio, Di Guglielmo, Giuseppe, Ngadiuba, Jennifer, Campos, Javier, Hawks, Ben, Gandrakota, Abhijith, Fahim, Farah, Tran, Nhan, Constantinides, George, Que, Zhiqiang, Luk, Wayne, Tapper, Alexander, Hoang, Duc, Paladino, Noah, Harris, Philip, Lai, Bo-Cheng, Valentin, Manuel, Forelli, Ryan, Ogrenci, Seda, Gerlach, Lino, Flynn, Rian, Liu, Mia, Diaz, Daniel, Khoda, Elham, Quinnan, Melissa, Solares, Russell, Parajuli, Santosh, Neubauer, Mark, Herwig, Christian, Tsoi, Ho Fung, Rankin, Dylan, Hsu, Shih-Chieh, Hauck, Scott
We present hls4ml, a free and open-source platform that translates machine learning (ML) models from modern deep learning frameworks into high-level synthesis (HLS) code that can be integrated into full designs for field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs). With its flexible and modular design, hls4ml supports a large number of deep learning frameworks and can target HLS compilers from several vendors, including Vitis HLS, Intel oneAPI and Catapult HLS. Together with a wider eco-system for software-hardware co-design, hls4ml has enabled the acceleration of ML inference in a wide range of commercial and scientific applications where low latency, resource usage, and power consumption are critical. In this paper, we describe the structure and functionality of the hls4ml platform. The overarching design considerations for the generated HLS code are discussed, together with selected performance results.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (26 more...)
- Information Technology (1.00)
- Government > Regional Government > North America Government > United States Government (0.93)
- Health & Medicine > Therapeutic Area (0.92)
- Energy (0.67)
Differentiable Weightless Controllers: Learning Logic Circuits for Continuous Control
Kresse, Fabian, Lampert, Christoph H.
We investigate whether continuous-control policies can be represented and learned as discrete logic circuits instead of continuous neural networks. We introduce Differentiable Weightless Controllers (DWCs), a symbolic-differentiable architecture that maps real-valued observations to actions using thermometer-encoded inputs, sparsely connected boolean lookup-table layers, and lightweight action heads. DWCs can be trained end-to-end by gradient-based techniques, yet compile directly into FPGA-compatible circuits with few- or even single-clock-cycle latency and nanojoule-level energy cost per action. Across five MuJoCo benchmarks, including high-dimensional Humanoid, DWCs achieve returns competitive with weight-based policies (full precision or quantized neural networks), matching performance on four tasks and isolating network capacity as the key limiting factor on HalfCheetah. Furthermore, DWCs exhibit structurally sparse and interpretable connectivity patterns, enabling a direct inspection of which input thresholds influence control decisions.
- Europe > Austria (0.04)
- Europe > Ukraine > Volyn Oblast > Luts'k (0.04)
- Europe > Portugal > Braga > Braga (0.04)
- (2 more...)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- (12 more...)
- Overview (1.00)
- Research Report > New Finding (0.46)
- Research Report > Experimental Study (0.46)
- Law (1.00)
- Information Technology (1.00)
- Health & Medicine > Therapeutic Area (1.00)
- Government (0.67)
JEDI-linear: Fast and Efficient Graph Neural Networks for Jet Tagging on FPGAs
Que, Zhiqiang, Sun, Chang, Paramesvaran, Sudarshan, Clement, Emyr, Karakoulaki, Katerina, Brown, Christopher, Laatu, Lauri, Cox, Arianna, Tapper, Alexander, Luk, Wayne, Spiropulu, Maria
Graph Neural Networks (GNNs), particularly Interaction Networks (INs), have shown exceptional performance for jet tagging at the CERN High-Luminosity Large Hadron Collider (HL-LHC). However, their computational complexity and irregular memory access patterns pose significant challenges for deployment on FPGAs in hardware trigger systems, where strict latency and resource constraints apply. In this work, we propose JEDI-linear, a novel GNN architecture with linear computational complexity that eliminates explicit pairwise interactions by leveraging shared transformations and global aggregation. To further enhance hardware efficiency, we introduce fine-grained quantization-aware training with per-parameter bitwidth optimization and employ multiplier-free multiply-accumulate operations via distributed arithmetic. Evaluation results show that our FPGA-based JEDI-linear achieves 3.7 to 11.5 times lower latency, up to 150 times lower initiation interval, and up to 6.2 times lower LUT usage compared to state-of-the-art GNN designs while also delivering higher model accuracy and eliminating the need for DSP blocks entirely. This is the first interaction-based GNN to achieve less than 60~ns latency and currently meets the requirements for use in the HL-LHC CMS Level-1 trigger system. This work advances the next-generation trigger systems by enabling accurate, scalable, and resource-efficient GNN inference in real-time environments. Our open-sourced templates will further support reproducibility and broader adoption across scientific applications.
- Europe > Ukraine > Volyn Oblast > Luts'k (0.04)
- North America > United States > California > Los Angeles County > Pasadena (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- (2 more...)
LL-ViT: Edge Deployable Vision Transformers with Look Up Table Neurons
Nag, Shashank, Bacellar, Alan T. L., Susskind, Zachary, Jha, Anshul, Liberty, Logan, Sivakumar, Aishwarya, John, Eugene B., Kailas, Krishnan, Lima, Priscila M. V., Yadwadkar, Neeraja J., Franca, Felipe M. G., John, Lizy K.
Vision Transformers have been tremendously successful in computer vision tasks. However, their large computational, memory, and energy demands are a challenge for edge inference on FPGAs -- a field that has seen a recent surge in demand. We recognize the benefits of recent works on logic and Look Up Table (LUT) based networks, such as LogicNets, NeuraLUT, DWN, among others, in offering models that simultaneously reduce both the memory and compute footprints. However, these models natively do not perform well on common vision tasks, such as CIFAR-10/100. In this work, we propose LL-ViT, a novel edge optimized vision transformer design that integrates layers of LUT neurons within the transformer architecture. Based on our characterization that reveals that a majority of model weights and computations are from the channel mixer (MLP layer), we design an alternate LUT-based channel mixer, and simultaneously develop an FPGA-based accelerator for LL-ViT. Contrary to some attempts to replace each multiplication with a table lookup, our architecture utilizes a neural learning approach which natively learns the LUT functions. This approach allows for reduced model sizes, and a computational and energy-efficient inference solution for vision transformer models. Evaluating on edge-suitable workloads, we achieve accuracies of 95.5% on CIFAR-10, 78.8% on CIFAR-100, and 60.9% on Tiny-ImageNet datasets, comparable to the baseline transformer. LL-ViT eliminates over 60% of the model weights and 50% of the multiplications in the model, and achieves 1.9x energy efficiency and 1.3x lower latency over an integer quantized ViT accelerator, while also offering superior throughput against prior works at a 10.9W power budget.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Ukraine > Volyn Oblast > Luts'k (0.04)
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
- (3 more...)
- Information Technology (0.93)
- Energy (0.66)
Sub-microsecond Transformers for Jet Tagging on FPGAs
Laatu, Lauri, Sun, Chang, Cox, Arianna, Gandrakota, Abhijith, Maier, Benedikt, Ngadiuba, Jennifer, Que, Zhiqiang, Luk, Wayne, Spiropulu, Maria, Tapper, Alexander
We present the first sub-microsecond transformer implementation on an FPGA achieving competitive performance for state-of-the-art high-energy physics benchmarks. Transformers have shown exceptional performance on multiple tasks in modern machine learning applications, including jet tagging at the CERN Large Hadron Collider (LHC). However, their computational complexity prohibits use in real-time applications, such as the hardware trigger system of the collider experiments up until now. In this work, we demonstrate the first application of transformers for jet tagging on FPGAs, achieving $\mathcal{O}(100)$ nanosecond latency with superior performance compared to alternative baseline models. We leverage high-granularity quantization and distributed arithmetic optimization to fit the entire transformer model on a single FPGA, achieving the required throughput and latency. Furthermore, we add multi-head attention and linear attention support to hls4ml, making our work accessible to the broader fast machine learning community. This work advances the next-generation trigger systems for the High Luminosity LHC, enabling the use of transformers for real-time applications in high-energy physics and beyond.
- North America > United States > California (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Ukraine > Volyn Oblast > Luts'k (0.04)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
Bhasha-Rupantarika: Algorithm-Hardware Co-design approach for Multilingual Neural Machine Translation
Lokhande, Mukul, Dewangan, Tanushree, Mansoori, Mohd Sharik, Chaudhari, Tejas, J., Akarsh, Lokhande, Damayanti, Teman, Adam, Vishvakarma, Santosh Kumar
This paper introduces Bhasha-Rupantarika, a light and efficient multilingual translation system tailored through algorithm-hardware codesign for resource-limited settings. The method investigates model deployment at sub-octet precision levels (FP8, INT8, INT4, and FP4), with experimental results indicating a 4.1x reduction in model size (FP4) and a 4.2x speedup in inference speed, which correlates with an increased throughput of 66 tokens/s (improvement by 4.8x). This underscores the importance of ultra-low precision quantization for real-time deployment in IoT devices using FPGA accelerators, achieving performance on par with expectations. Our evaluation covers bidirectional translation between Indian and international languages, showcasing its adaptability in low-resource linguistic contexts. The FPGA deployment demonstrated a 1.96x reduction in LUTs and a 1.65x decrease in FFs, resulting in a 2.2x enhancement in throughput compared to OPU and a 4.6x enhancement compared to HPTA. Overall, the evaluation provides a viable solution based on quantisation-aware translation along with hardware efficiency suitable for deployable multilingual AI systems. The entire codes [https://github.com/mukullokhande99/Bhasha-Rupantarika/] and dataset for reproducibility are publicly available, facilitating rapid integration and further development by researchers.
- Asia > India (0.15)
- North America > United States (0.04)
- Europe > Ukraine > Volyn Oblast > Luts'k (0.04)
- Asia > Middle East > Israel (0.04)
- Information Technology (0.47)
- Government (0.46)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Guangdong Province > Guangzhou (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- (12 more...)
- Overview (1.00)
- Research Report > New Finding (0.46)
- Research Report > Experimental Study (0.46)
- Law (1.00)
- Information Technology (1.00)
- Health & Medicine > Therapeutic Area (1.00)
- (2 more...)
Generating Multi-Table Time Series EHR from Latent Space with Minimal Preprocessing
Cho, Eunbyeol, Kim, Jiyoun, Lee, Minjae, Park, Sungjin, Choi, Edward
Electronic Health Records (EHR) are time-series relational databases that record patient interactions and medical events over time, serving as a critical resource for healthcare research and applications. However, privacy concerns and regulatory restrictions limit the sharing and utilization of such sensitive data, necessitating the generation of synthetic EHR datasets. Unlike previous EHR synthesis methods, which typically generate medical records consisting of expert-chosen features (e.g. a few vital signs or structured codes only), we introduce RawMed, the first framework to synthesize multi-table, time-series EHR data that closely resembles raw EHRs. Using text-based representation and compression techniques, RawMed captures complex structures and temporal dynamics with minimal preprocessing. We also propose a new evaluation framework for multi-table time-series synthetic EHRs, assessing distributional similarity, inter-table relationships, temporal dynamics, and privacy. Validated on two open-source EHR datasets, RawMed outperforms baseline models in fidelity and utility. The code is available at https://github.com/eunbyeol-cho/RawMed.
- Europe > Ukraine > Volyn Oblast > Luts'k (0.04)
- Asia > Middle East > Israel (0.04)
- Information Technology > Databases (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
- Information Technology > Data Science > Data Mining (0.88)
DPUV4E: High-Throughput DPU Architecture Design for CNN on Versal ACAP
Li, Guoyu, Zheng, Pengbo, Weng, Jian, Yang, Enshan
--Convolutional Neural Networks (CNNs) remain prevalent in computer vision applications, and FPGAs, known for their flexibility and energy efficiency, have become essential components in heterogeneous acceleration systems. AMD's V ersal ACAP architecture, tailored for AI applications, incorporates AI Engines (AIEs) to deliver high computational power . Nevertheless, the platform suffers from insufficient memory bandwidth, hindering the full utilization of the AIEs' theoretical performance. We design two computation units, Conv PE and DWC PE, to support different computational patterns. Each computation unit's data flow efficiently utilizes the data reuse opportunities to mitigate bandwidth bottlenecks. Additionally, we extend the functionality of each PE to utilize AIEs for non-convolutional operations, reducing resource overhead. Experiments on over 50 models show that compared to previous designs, our design provides 8 . At present, deep learning (DL) has profoundly integrated into our daily lives. Despite the emergence of new transformer-based neural networks, Convolutional Neural Networks (CNN) remain extensively employed owing to their proficiency in extracting local information from images in relatively smaller datasets. GPUs' efficient parallel processing is used to improve CNN inference, but their general-purpose design reduces energy efficiency. To improve accelerators' energy efficiency and throughput, custom CNN architectures have been proposed.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > United States > Utah > Salt Lake County > Salt Lake City (0.04)
- (13 more...)